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Abstract 

Much of the power of probabilistic methods in 
modelling language comes from their ability to 
compare several derivations for the same string 
in the language. An important starting point 
for the study of such cross-derivational proper- 
ties is the notion of consistency. The probabil- 
ity model defined by a probabilistic grammar is 
said to be consistent if the probabilities assigned 
to all the strings in the language sum to one. 
From the literature on probabilistic context-free 
grammars (CFGs), we know precisely the con- 
ditions which ensure that consistency is true for 
a given CFG. This paper derives the conditions 
under which a given probabilistic Tree Adjoin- 
ing Grammar (TAG) can be shown to be con- 
sistent. It gives a simple algorithm for checking 
consistency and gives the formal justification 
for its correctness. The conditions derived here 
can be used to ensure that probability models 
that use TAGs can be checked for deficiency 
(i.e. whether any probability mass is assigned 
to strings that cannot be generated). 

1 Introduction 

Much of the power of probabilistic methods 
in modelling language comes from their abil- 
ity to compare several derivations for the same 
string in the language. This cross-derivational 
power arises naturally from comparison of vari- 
ous derivational paths, each of which is a prod- 
uct of the probabilities associated with each step 
in each derivation. A common approach used 
to assign structure to language is to use a prob- 
abilistic grammar where each elementary rule 
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or production is associated with a probability. 
Using such a grammar, a probability for each 
string in the language is computed. Assum- 
ing that the probability of each derivation of a 
sentence is well-defined, the probability of each 
string in the language is simply the sum of the 
probabilities of all derivations of the string. In 
general, for a probabilistic grammar G the lan- 
guage of G is denoted by L(G). Then if a string 
v is in the language L(G) the probabilistic gram- 
mar assigns v some non-zero probability. 

There are several cross-derivational proper- 
ties that can be studied for a given probabilis- 
tic grammar formalism. An important starting 
point for such studies is the notion of consis- 
tency. The probability model defined by a prob- 
abilistic grammar is said to be consistent if the 
probabilities assigned to all the strings in the 
language sum to 1. That is, if Pr defined by a 
probabilistic grammar, assigns a probability to 
each string v E £*, where Pr(t>) = if v L(G), 
then 



Pr(v) 



1 



(1) 



From the literature on probabilistic context- 
free grammars (CFGs) we know precisely the 
conditions which ensure that (jlj) is true for a 
given CFG. This paper derives the conditions 
under which a given probabilistic TAG can be 
shown to be consistent. 

TAGs are important in the modelling of nat- 
ural language since they can be easily lexical- 
ized; moreover the trees associated with words 
can be used to encode argument and adjunct re- 
lations in various syntactic environments. This 
paper assumes some familiarity with the TAG 
formalism. flJoshi, 1988| ) and ( Joshi and Sch 



nar. Giorgio Satta. B. Srinivas. Fei Xia and the two a ^S, 1992Q are good introductions to the for- 



anonymous reviewers for their valuable comments. 



malism and its linguistic relevance. TAGs have 



been shown to have relations with both phrase- 
structure grammars and dependency grammars 
( Rainbow and Joshi, 1995| ) and can handle 
(non-projective) long distance dependencies. 

Consistency of probabilistic TAGs has prac- 
tical significance for the following reasons: 

• The conditions derived here can be used 
to ensure that probability models that use 
TAGs can be checked for deficiency. 

• Existing EM based estimation algorithms 
for probabilistic TAGs assume that the 



property of consistency holds ( Schabes 



1992| ), EM based algorithms begin with an 



initial (usually random) value for each pa- 
rameter. If the initial assignment causes 
the grammar to be inconsistent, then it- 
erative re-estimation might converge to an 
inconsistent grammar]]. 

• Techniques used in this paper can be used 
to determine consistency for other proba- 
bility models based on TAGs ( |Carroll and| 
Weir, 19971 ). 



2 Notation 

In this section we establish some notational con- 
ventions and definitions that we use in this pa- 
per. Those familiar with the TAG formalism 
only need to give a cursory glance through this 
section. 

A probabilistic TAG is represented by 
(N,T,,I, A, S, (ft) where N, £ are, respectively, 
non-terminal and terminal symbols. X U A is a 
set of trees termed as elementary trees. We take 
V to be the set of all nodes in all the elementary 
trees. For each leaf A £ V, label (A) is an ele- 
ment from Su{e}, and for each other node A, 
label (A) is an element from N. S is an element 
from N which is a distinguished start symbol. 
The root node A of every initial tree which can 
start a derivation must have label(A) = S. 

X are termed initial trees and A are auxil- 
iary trees which can rewrite a tree node A £ V. 
This rewrite step is called adjunction, (ft is a 
function which assigns each adjunction with a 
probability and denotes the set of parameters 



Note that for CFGs it has been shown in (Chaud- 



hari et al., 1983; Sanchez and Benedi, 1997) that inside- 



outside reestimation can be used to avoid inconsistency. 
We will show later in the paper that the method used to 
show consistency in this paper precludes a straightfor- 
ward extension of that result for TAGs. 



in the model. In practice, TAGs also allow a 
leaf nodes A such that label(A) is an element 
from N. Such nodes A are rewritten with ini- 
tial trees from I using the rewrite step called 
substitution. Except in one special case, we 
will not need to treat substitution as being dis- 
tinct from adjunction. 

For t £ X U A, A{t) are the nodes in tree 
t that can be modified by adjunction. For 
label(A) £ N we denote Adj(label(A)) as the 
set of trees that can adjoin at node A £ V. 
The adjunction of t into N £ V is denoted by 
N i— > t. No adjunction at N £ V is denoted 
by N i— > nil. We assume the following proper- 
ties hold for every probabilistic TAG G that we 
consider: 

1. G is lexicalized. There is at least one 
leaf node a that lexicalizes each elementary 
tree, i.e. a £ S. 

2. G is proper. For each N £ V, 

<ft(N i — > nil) + J2 <t>( N >->t) = l 



3. Adjunction is prohibited on the foot node 
of every auxiliary tree. This condition is 
imposed to avoid unnecessary ambiguity 
and can be easily relaxed. 

4. There is a distinguished non-lexicalized ini- 
tial tree r such that each initial tree rooted 
by a node A with label (A) = S substitutes 
into t to complete the derivation. This en- 
sures that probabilities assigned to the in- 
put string at the start of the derivation are 
well- formed. 

We use symbols S, A, B, . . . to range over V, 
symbols a, b, c, . . . to range over S. We use 
t\,t2, ■ ■ ■ to range over I U A and e to denote 
the empty string. We use Xj to range over all % 
nodes in the grammar. 

3 Applying probability measures to 
Tree Adjoining Languages 

To gain some intuition about probability assign- 
ments to languages, let us take for example, a 
language well known to be a tree adjoining lan- 
guage: 

L(G) = {a n b n c n d n \n > 1} 



It seems that we should be able to use a func- 
tion ip to assign any probability distribution to 
the strings in L{G) and then expect that we can 
assign appropriate probabilites to the adjunc- 
tions in G such that the language generated by 
G has the same distribution as that given by 
ip. However a function ip that grows smaller 
by repeated multiplication as the inverse of an 
exponential function cannot be matched by any 
TAG because of the constant growth property of 
TAGs (see flVijay-Shanker, 1987] ), p. 104). An 
example of such a function if) is a simple Pois- 
son distribution (||) , which in fact was also used 
as the counterexample in flBooth and Thomp-| 
son, 1973| ) for CFGs, since CFGs also have the 
constant growth property. 



4>(a n b n c n d n ) 



1 



e • n! 



This shows that probabilistic TAGs, like CFGs, 
are constrained in the probabilistic languages 
that they can recognize or learn. As shown 
above, a probabilistic language can fail to have 
a generating probabilistic TAG. 

The reverse is also true: some probabilis- 
tic TAGs, like some CFGs, fail to have a 
corresponding probabilistic language, i.e. they 
are not consistent. There are two reasons 
why a probabilistic TAG could be inconsistent: 
"dirty" grammars, and destructive or incorrect 
probability assignments. 

"Dirty" grammars. Usually, when applied 
to language, TAGs are lexicalized and so prob- 
abilities assigned to trees are used only when 
the words anchoring the trees are used in a 
derivation. However, if the TAG allows non- 
lexicalized trees, or more precisely, auxiliary 
trees with no yield, then looping adjunctions 
which never generate a string are possible. How- 
ever, this can be detected and corrected by a 
simple search over the grammar. Even in lexi- 
calized grammars, there could be some auxiliary 
trees that are assigned some probability mass 
but which can never adjoin into another tree. 
Such auxiliary trees are termed unreachable and 
techniques similar to the ones used in detecting 
unreachable productions in CFGs can be used 
here to detect and eliminate such trees. 

Destructive probability assignments. 
This problem is a more serious one, and is the 
main subject of this paper. Consider the prob- 



abilistic TAG shown in (|)|. 



h 5i 



4>(Si i-> t 2 ) 




0.99 
= 0.01 

0.98 
= 0.02 



(3) 



Consider a derivation in this TAG as a genera- 
tive process. It proceeds as follows: node S\ in 
t\ is rewritten as t 2 with probability 1.0. Node 
S 2 in t 2 is 99 times more likely than not to be 
rewritten as t 2 itself, and similarly node S3 is 49 
times more likely than not to be rewritten as t 2 . 
This however, creates two more instances of S 2 
and 53 with same probabilities. This continues, 
creating multiple instances of t 2 at each level of 
the derivation process with each instance of t 2 
creating two more instances of itself. The gram- 
mar itself is not malicious; the probability as- 
signments are to blame. It is important to note 
that inconsistency is a problem even though for 
any given string there are only a finite number 
of derivations, all halting. Consider the prob- 
ability mass function (pmf) over the set of all 
derivations for this grammar. An inconsistent 
grammar would have a pmf which assigns a large 
portion of probability mass to derivations that 
are non-terminating. This means there is a fi- 
nite probability the generative process can enter 
a generation sequence which has a finite proba- 
bility of non-termination. 

4 Conditions for Consistency 

A probabilistic TAG G is consistent if and only 
if: 



veL(G) 



(4) 



where Pr(v) is the probability assigned to a 
string in the language. If a grammar G does 
not satisfy this condition, G is said to be incon- 
sistent. 

To explain the conditions under which a prob- 
abilistic TAG is consistent we will use the TAG 



2 The subscripts are used as a simple notation to 
uniquely refer to the nodes in each elementary tree. They 
are not part of the node label for purposes of adjunction. 



in (||) as an example. 



ai 



<t>(Ai 



t 2 ) = 

nil) 



t-2 



0.8 
= 0.2 





M = P -N 



At 
A 2 
Bi 
As 
B 2 



Ax A 2 Bi A 3 B 2 

0.8 0.8 0.8 

0.2 0.2 0.2 

0.2 

0.4 0.4 0.4 

0.1 



<P(A 2 
4>{A 2 i 

<KBi 
<j>[B x 
0{A 3 
(f>(A 3 



t 2 ) = 
nil) 

h) = 

nil) 
t 2 ) = 
nil) 



0.2 
= 0.8 

0.2 
= 0.8 

0.4 
= 0.6 



</>{B 2 
<P(B 2 



ts) = 
nil) 



0.1 
= 0.9 



(5) 



By inspecting the values of Ai in terms of the 
grammar probabilities indicates that Aiij con- 
tains the values we wanted, i.e. expectation of 
obtaining node Aj when node Ai is rewritten by 
adjunction at each level of the TAG derivation 
process. 

By construction we have ensured that the 
following theorem from flBooth and Thomp 



From thin grammar, we compute a aquarc ma 
trix Ai which of size \V\, where V is the set 
of nodes in the grammar that can be rewrit- 
ten by adjunction. Each Aiij contains the ex- 
pected value of obtaining node Xj when node 
Xi is rewritten by adjunction at each level of a 
TAG derivation. We call Ai the stochastic ex- 
pectation matrix associated with a probabilistic 
TAG. 

To get Ai for a grammar we first write a ma- 
trix P which has \V\ rows and \I U A\ columns. 
An element Py corresponds to the probability 
of adjoining tree tj at node Xi, i.e. <ft(Xi i— » 





h 


t 2 


h 


A x 


' 


0.8 


" 


A 2 





0.2 





P= Bx 








0.2 


A 3 





0.4 





B 2 








0.1 



We then write a matrix N which has \I U A\ 
rows and |V| columns. An element Njj is 1.0 if 
node Xj is a node in tree tj. 



N 



tx 
t 2 

h 



Ax 
1.0 





A 2 


1.0 




Bx 


1.0 




A 3 


1.0 




B 2 



1.0 



Then the stochastic expectation matrix Ai is 
simply the product of these two matrices. 

3 Note that P is not a row stochastic matrix. This 
is an important difference in the construction of M for 



TAGs when compared to CFGs. 
point in §0. 



We will return to this 



son, 1973) applies to probabilistic TAGs. A 
formal justification for this claim is given in 
the next section by showing a reduction of the 
TAG derivation process to a multitype Galton- 
Watson branching process ( [Harris, 1963| ). 

Theorem 4.1 A probabilistic grammar is con- 
sistent if the spectral radius p(Ai) < 1, where 
Ai is the stochastic expectation matrix com- 



puted from the grammar. (Booth and Thomp- 
s on, 197$ ; \Soule, 1974 ) 



This theorem provides a way to determine 
whether a grammar is consistent. All we need to 
do is compute the spectral radius of the square 
matrix Ai which is equal to the modulus of the 
largest eigenvalue of Ai . If this value is less than 
one then the grammar is consistent^. Comput- 
ing consistency can bypass the computation of 
the eigenvalues for Ai by using the following 



theorem by Gersgorin (see ( Horn and Johnson 
1985| ; [Wetherell, 1980D ). 



Theorem 4.2 For any square matrix Ai, 
p{Ai) < 1 if and only if there is an n > 1 
such that the sum of the absolute values of 
the elements of each row of Ai n is less than 
one. Moreover, any n' > n also has this prop- 



erty. ( Gersgorin , see (Horn and Johnson, 1981; 
Wetherell, 1987k) ) 



4 The grammar may be consistent when the spectral 
radius is exactly one, but this case involves many special 
considerations and is not considered in this paper. In 
practice, these co mplicated tes ts are probably not worth 
the effort. See (Harris, 1963) for details on how this 



special case can be solved. 



This makes for a very simple algorithm to 
check consistency of a grammar. We sum the 
values of the elements of each row of the stochas- 
tic expectation matrix A4 computed from the 
grammar. If any of the row sums are greater 
than one then we compute A4 2 , repeat the test 
and compute M 2 if the test fails, and so on un- 
til the test succeeds^. The algorithm does not 
halt if p{M) > 1. In practice, such an algorithm 
works better in the average case since compu- 
tation of eigenvalues is more expensive for very 
large matrices. An upper bound can be set on 
the number of iterations in this algorithm. Once 
the bound is passed, the exact eigenvalues can 
be computed. 

For the grammar in (||) we computed the fol- 
lowing stochastic expectation matrix: 



M 










0.8 
0.2 

0.4 







0.8 
0.2 


0.4 





0.8 
0.2 


0.4 








0.2 


0.1 



The first row sum is 2.4. Since the sum of 
each row must be less than one, we compute the 
power matrix Ai 2 . However, the sum of one of 
the rows is still greater than 1. Continuing we 
compute M 2 . 



0.1728 0.1728 0.1728 0.0688 

0.0432 0.0432 0.0432 0.0172 

0.0002 

0.0864 0.0864 0.0864 0.0344 

0.0001 



This time all the row sums are less than one, 
hence p{M) < 1. So we can say that the gram- 
mar defined in (||) is consistent. We can confirm 
this by computing the eigenvalues for Ai which 
are 0,0,0.6,0 and 0.1, all less than 1. 

Now consider the grammar (||) we had con- 
sidered in Section ||. The value of M. for that 
grammar is computed to be: 



Si 



S 2 

s 3 



s 2 

1.0 
0.99 
0.98 



S 3 

1.0 
0.99 
0.98 



We compute M 2 and su bseq uently only successive 



powers of 2 because Theorem 4^2] holds for any n' > n. 
This permits us to use a single matrix at each step in 
the algorithm. 



The eigenvalues for the expectation matrix 
M. computed for the grammar (||) are 0, 1.97 
and 0. The largest eigenvalue is greater than 
1 and this confirms (||) to be an inconsistent 
grammar. 



TAG Derivations and Branching 
Processes 



To show that Theorem iA in Section || holds 
for any probabilistic TAG, it is sufficient to show 
that the derivation process in TAGs is a Galton- 
Watson branching process. 



A Galton- Watson branching process (Harris 



1963) is simply a model of processes that have 



objects that can produce additional objects of 
the same kind, i.e. recursive processes, with cer- 
tain properties. There is an initial set of ob- 
jects in the 0-th generation which produces with 
some probability a first generation which in turn 
with some probability generates a second, and 
so on. We will denote by vectors Zq, Zi, Z2, ■ ■ ■ 
the 0-th, first, second, . . . generations. There 
are two assumptions made about Zq, Zi, Z2, ■ ■ ■■ 



1. The size of the n-th generation does not 
influence the probability with which any of 
the objects in the (n + l)-th generation is 
produced. In other words, Zq, Z\, Z2, . . . 
form a Markov chain. 



2. The number of objects born to a parent 
object does not depend on how many other 
objects are present at the same level. 



We can associate a generating function for 
each level Z{ . The value for the vector Z n is the 
value assigned by the n-th iterate of this gen- 
erating function. The expectation matrix M is 
defined using this generating function. 

The theorem attributed to Galton and Wat- 
son specifies the conditions for the probability 
of extinction of a family starting from its 0-th 
generation, assuming the branching process rep- 
resents a family tree (i.e, respecting the condi- 
tions outlined above). The theorem states that 
p(M) < 1 when the probability of extinction is 



1.0. 



tl 



t2(0) 



"level 



level 1 



t2 (0) t3 (1) t2 (1.1) level 2 



t2 (l.l)t3 (0) 



level 3 
level 4 



(6) 




(7) 



The assumptions made about the generating 
process intuitively holds for probabilistic TAGs. 
(Q), for example, depicts a derivation of the 
string a2a2«20203«3«i by a sequence of adjunc- 
tions in the grammar given in The parse 
tree derived from such a sequence is shown in 
Fig. 0. In the derivation tree (||), nodes in the 
trees at each level i are rewritten by adjunction 
to produce a level i + 1. There is a final level 4 
in (Q) since we also consider the probability that 
a node is not rewritten further, i.e. ~Pt(A i— » nil) 
for each node A. 

We give a precise statement of a TAG deriva- 
tion process by defining a generating function 
for the levels in a derivation tree. Each level 
% in the TAG derivation tree then corresponds 
to Zi in the Markov chain of branching pro- 



The numbers in parentheses next to the tree names 
are node addresses where each tree has adjoined into 
its parent. Recall the definition of node addresses in 
Section |^. 



cesses. This is sufficient to justify the use of 
Theorem 4.1 in Section 0. The conditions on 



the probability of extinction then relates to the 
probability that TAG derivations for a proba- 
bilistic TAG will not recurse infinitely. Hence 
the probability of extinction is the same as the 
probability that a probabilistic TAG is consis- 
tent. 

For each Xj £ V, where V is the set of nodes 
in the grammar where adjunction can occur, 
we define the /c-argument adjunction generating 
function over variables si, . . . , s& corresponding 
to the k nodes in V. 



gj(si, . . . ,s k ) = 
teAdj(Xj)u{nii} 



n(t) 



where, rj(t) = 1 iff node Xj is in tree t, Tj(t) = 
otherwise. 

For example, for the grammar in (||) we get 
the following adjunction generating functions 
taking the variable s\, S2, S3, S4, S5 to represent 
the nodes A\ , A2 , B\ , A3 , B2 respectively. 

gx(si, . . . , s 5 ) = 

4>(Ai 1-^ i 2 ) • s 2 • s 3 • s 4 + 4>{A\ i-> nil) 
92(s%, . . . ,s 5 ) = 

4>(A 2 i-> t 2 ) ■ s 2 ■ s 3 ■ S4 + (f)(A 2 i-> nil) 
93(s%, . . . ,s 5 ) = 

4>{B\ 1 — > ts) • S5 + 4>{B\ 1— *• niZ) 
5 4 (si,...,s 5 ) = 

^(t4 3 1-^ t 2 ) • s 2 • s 3 • s 4 + ^(^3 i-> mi) 
5 5 (si,...,s 5 ) = 

^(S 2 h-> t 3 ) • s 5 + 0(S 2 i-> m/) 

The n-th level generating function 
G n (si, . . . , Sfc) is defined recursively as fol- 
lows. 



G (si, . . 


• ,Sfc) 


= Si 


Gl( Sl ,.. 


• ,s fc ) 


= Si(si,...,s fc ) 


G„(si, . . 


• ,Sk) 


= G n -i[gi(s 1 , . . . ,s k ), 



gk(si, ■ ■ ■ ,s k )] 

For the grammar in (0) we get the following 
level generating functions. 

G (si, . . . ,s 5 ) = si 



. . . ,s 5 ) =5i(si,...,s 5 ) 
= 0(^4i i-> t 2 ) • s 2 • s 3 • s 4 + (/>(^4i i * nil) 
= 0.8 • s 2 • s 3 • s 4 + 0.2 
G 2 (si, . . . ,s 5 ) = 

(p(A 2 h-> t 2 )[g2{si, ■ ■ -,s 5 )] [33 Oi, • • • ,55)] 
[^ 4 (si, . . . , s 5 )] + (^(,42 i-> mZ) 

= 0.08s|sf S4S5 + 0.03S2S|S4 + 0.04S2S 3 S 4 S5 + 

0.18s 2 s 3 s 4 + 0.04s 5 + 0.196 



Examining this example, we can express 
Gi(si,. . . ,s k ) as a sum D^sx, . . . ,s k ) + C i: 
where C{ is a constant and -Dj(-) is a polyno- 
mial with no constant terms. A probabilistic 
TAG will be consistent if these recursive equa- 
tions terminate, i.e. iff 



Urrii-taoD^si, ...,s k ) 







We can rewrite the level generation functions in 
terms of the stochastic expectation matrix A4, 
where each element of A4 is compu ted as 
follows (cf. flBooth and Thompson, 1973|) ). 



dgi(si,...,s k ) 



(8) 



81, 



,Sfc = l 



The limit condition above translates to the con- 
dition that the spectral radius of Ai must be 
less than 1 for the grammar to be consistent. 



This shows that Theorem 4.1 used in Sec 



tion Q to give an algorithm to detect inconsis- 
tency in a probabilistic holds for any given TAG, 
hence demonstrating the correctness of the al- 
gorithm. 

Note that the formulation of the adjunction 
generating function means that the values for 
(j)(X 1— > nil) for all X G V do not appear in 
the expectation matrix. This is a crucial differ- 
ence between the test for consistency in TAGs 
as compared to CFGs. For CFGs, the expecta- 
tion matrix for a grammar G can be interpreted 
as the contribution of each non-terminal to the 
derivations for a sample set of strings drawn 
from L(G). Using this it was shown in flChaud- 



hari et al., 1983| ) and ( [Sanchez and Benedi . 
1997[ ) that a single step of the inside-outside 



algorithm implies consistency for a probabilis- 
tic CFG. However, in the TAG case, the inclu- 
sion of values for <j>{X 1— > nil) (which is essen- 



tial if we are to interpret the expectation ma- 
trix in terms of derivations over a sample set of 
strings) means that we cannot use the method 
used in (||) to compute the expectation matrix 
and furthermore the limit condition will not be 
convergent. 

6 Conclusion 

We have shown in this paper the conditions 
under which a given probabilistic TAG can be 
shown to be consistent. We gave a simple al- 
gorithm for checking consistency and gave the 
formal justification for its correctness. The re- 
sult is practically significant for its applications 
in checking for deficiency in probabilistic TAGs. 
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